This question will form the central research question of our project. It will also represent the predictive portion of the project.
We will begin this analysis by engaging in some exploratory analysis. Summary statistics could provide a good basis for this, however given the emphasis of this course we will create some exploratory visualizations to begin to understand our data.
#Exploratory Graph On Time Graduation Rates by Year*University
allRaceallGender<-fullData[fullData$race=='A' & fullData$gender=='B',]
onTimeGradbyUFigure <- plot_ly() %>% add_trace(x=allRaceallGender$year,y=allRaceallGender$grad_100_rate,type='scatter',color=allRaceallGender$chronname,mode='lines') %>% layout(title="On-time Graduation Rates by Year and University")
onTimeGradbyUFigure
We will utilize standard least squares linear regression in order to generate predictions of cohort graduation rates across the SEC universities over time. Given that our response data (on-time graduation rates by student body) is continuous numerical data, and we have categorical and numeric inputs, linear regression is a reasonable choice to explore possible relationships in the data. We have also explored the normalcy of our datasets and find them to be acceptably normal to satisfy the underlying assumption. Our covariates of interest will be:
Additionally linear regression is an obvious choice to satisfy the project need of a predictive type of analysis.
Variable selection will be selected using a step wise regression and adjusted R-squared. The limitations and bias of this method are acknowledged but given the course’s focus on visualization as opposed to advanced analysis, it will likely produce acceptable research results.
We will validate this research methodology by examining all underlying assumptions regarding the model particularly, constant variance of the error rate and observation independence. We will produce diagnostic charts like those below in order to confirm this as a reasonable analytic methodology.
fullData<-fullData[fullData$race!='A' & fullData$gender!='B',]
fullData$raceAndGender <- paste(fullData$race,fullData$gender)
lm3 <- lm(fullData$grad_100_rate ~ fullData$total.wins + fullData$chronname + fullData$raceAndGender)
#Here we will suppress the lm3 summary results in order to limit the document size.
#summary(lm3)
lm3ResidualsUniversity <- plot_ly() %>% add_trace(x=fitted(lm3),y=residuals(lm3),type='scatter',mode='markers', color=fullData$chronname) %>% layout(title="Model Residuals by University")
lm3ResidualsUniversity
lm3ResidualsRandG <- plot_ly() %>% add_trace(x=fitted(lm3),y=residuals(lm3),type='scatter',mode='markers', color=fullData$raceAndGender) %>% layout(title="Model Residuals by Race and Gender")
lm3ResidualsRandG
par(mfrow=c(2,2)) # Change the panel layout to 2 x 2
plot(lm3)
par(mfrow=c(1,1)) # Change back to 1 x 1
In this case we can add an interaction effect to the regression model. To do this we widen the data and use “dummy variables” to indicate each university. We can then multiply that University by its record to determine any interaction effect. Race and Gender are also widened in order explore intersectional effects in similar ways.
lm2 <- lm(widerData$grad_100_rate ~ widerData$total.wins + widerData$Alabama + widerData$Arkansas + widerData$Auburn + widerData$Florida + widerData$Georgia + widerData$Kentucky + widerData$LSU + widerData$Mississippi.State + widerData$Missouri + widerData$Ole.Miss + widerData$South.Carolina + widerData$Tennessee + widerData$Texas.A.M + widerData$Vanderbilt + widerData$M.H + widerData$F.H + widerData$F.X + widerData$F.Ai + widerData$F.B + widerData$F.W + widerData$M.B + widerData$M.W + widerData$M.X + widerData$M.Ai)
#The summary is again suppressed in order to keep the report to a reasonable length.
#summary(lm2)
As this is a visualization course we can also explore these relationships graphically. Below is one simple example, a scatter plot. This is an appropriate
#Exploratory Graph On Time Graduation Rates by Football Record*University
onTimeGradbyRecord <- plot_ly() %>% add_trace(x=(allRaceallGender$total.wins/allRaceallGender$total.games),y=allRaceallGender$grad_100_rate,type='scatter',color=allRaceallGender$chronname,mode='markers') %>% layout(title="On-time Graduation Rates by Football Record and University")
onTimeGradbyRecord
Similar to university level impacts, we are interested in wether demographics play a role in how a football team’s record could influence on-time graduation. We can perform a nearly identical analysis as the University level.
lm1 <- lm(widerData$grad_100_rate ~ widerData$total.wins + widerData$Alabama + widerData$Arkansas + widerData$Auburn + widerData$Florida + widerData$Georgia + widerData$Kentucky + widerData$LSU + widerData$Mississippi.State + widerData$Missouri + widerData$Ole.Miss + widerData$South.Carolina + widerData$Tennessee + widerData$Texas.A.M + widerData$Vanderbilt + widerData$M.H + widerData$F.H + widerData$F.X + widerData$F.Ai + widerData$F.B + widerData$F.W + widerData$M.B + widerData$M.W + widerData$M.X + widerData$M.Ai + (widerData$total.wins * (widerData$M.H + widerData$F.H + widerData$F.X + widerData$F.Ai + widerData$F.B + widerData$F.W + + widerData$M.B + widerData$M.W + widerData$M.X + widerData$M.Ai)))
#summary(lm1)
#Exploratory Graph On Time Graduation Rates by Football Record*University
onTimeGradbyRecord <- plot_ly() %>% add_trace(x=(fullData$total.wins/fullData$total.games),y=fullData$grad_100_rate,type='scatter',color=fullData$raceAndGender,mode='markers') %>% layout(title="On-time Graduation Rates by Football Record and Demographics")
onTimeGradbyRecord
We are also considering the use of a violin plot. A violin plot makes sense as we have multiple observations across man intersectional categories of race/university and the binary categorical variable of gender.
A heat map which selects for two of the categorical variables, Race, Gender, and University may also be appropriate as a single continuous numeric dependent variable exists.
In order to investigate multiple year impacts, previous years records will be appended to the dataset and the regression will be run multiple times, using a stepwise regression algorithm and adjusted R-squared. An interactive feature could allow a user to test different time values for enduring impact and return the strength of the corresponding model.
In order to investigate any time series auto-regressive / moving average type impacts we will utilize an ARIMA type model. Some diagnostic plots associated with the ARIMA model may be used, but these are generally somewhat visually uninteresting for the casual observer. ARIMA can be applied before the regression and regression used afterward or ARIMA may be applied to residuals. For the purposes of interpret-ability we will apply the auto.arima() function to the residuals of our regression function in order to detect time series effect.
As a visualization we plan to incorporate a parallel coordinate chart which displays the model residuals over several years. This will be an appropriate visualization as periodicity 1 time series will tend to have residuals across time that are near to parallel.